HFLT TnTTW« BXrn -REPAIRING 



5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 



narKGROUND 



1. FIELD OF THE INVENTION 



, ^ ^nerallv to robustness (resis- 
This invention relates generally t 

^ce to failure, in colter system; and »ore particularly 
to novel apparatus an* uethods «« shields and P — 
colter syste*. - - - conventional 

systems — from failure. 
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17 2. RELATED ART 



(a) KMlis^blisati^ - U.— below, and wholly 
incorporated by reference into the present document, are ear- 
ner materials in this field that will be helpful in orients 
the reader. Cross-references to these publications, by nuaber 
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in the following list, appear enclosed in square brackets in 
the present document: 

[1] Intel Corp., t^t. Quality System Databook (January 
1998), Order No. 210997-007. 

[2] A. Avizienis and Y. He, "Microprocessor entomology: A 
taxonomy of design faults in COTS microprocessors", in J. 
Rushby and C. B. Weinstock, editors, De pendable Computing for 
r^«H«»l Appl ications 7 , IEEE Computer Society Press (1999) . 

[3] A. Avizienis and J. P. J- Kelly, "Fault tolerance by 
design diversity: concepts and experiments", Computer, 
17(8): 67-80 (August 1984). 

[4] A. Avizienis, -The N-version approach to fault- tolerant 
software", ™" Tr- a ns . Software Eng.. SE11 (12) : 1491-1501 (De- 
cember 1985) . 

[5] M. K. Joseph and A. Avizienis, "Software fault tolerance 
and computer security: A shared problem", in Proc. of the 



A. Avizienis, Ph. D. / xAAA-02 



P. Lippman / June 20 , 2001 



• # 

Annual National Joint: Conference and Tutorial on Software 
Quality and Reliability , pages 428-36 (March 1989) . 

[6] Y. He, An Investigation of Commercial Of f-the-Shelf (COTS) 
Based Fault Tolerance , PhD thesis, Computer Science Depart- 
ment, University of California, Los Angeles (September 1999) . 

[7] Y. He and A. Avizienis, "Assessment of the applicability 
of COTS microprocessors in high-confidence computing systems : 
A case study", in Proceedings of ICDSN 2000 (June 2000) . 

[8] Intel Corp., The Pentium II Xeon Processor Server Platform 
System Management Guide (June 1998), Order No. 243835-001. 

[9] A. Avizienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, 
J. A. Rohr, and D. K. Rubin. "The STAR (Self-Testing-and-Re- 
pairing) computer: An investigation of the theory and prac- 
tice of fault-tolerant computer design", IEEE Trans . Comp . . 
C-20 (11) : 1312-21 (November 1971) . 

[10] T. B. Smith, " Fault- tolerant clocking system", in Digest 
of FTCS-11 , pages 262-64 (June 1981) . 
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[11] Intel Corp., P6 Family of Processors Hardware Developer's 
Manual (September 1998), Order No. 244001-001. 

[12] A. Avizienis, "Toward systematic design of fault-tolerant 
systems", Computer , 30(4):51-58 (April 1997). 

[13] "Special report: Sending astronauts to Mars", Scientific 
American , 282(3): 40-63 (March 2000). 

[14] NASA, "Conference on enabling technology and required 
scientific developments for interstellar missions", OSS Ad- 
vanced Concepts Newsletter , page 3 (March 1999) . 



(b) Failure of computer systems — The purpose of a 
computer system is to deliver information processing services 
according to a specification. Such a system is said to "fail 
when the service that it delivers stops or when it becomes 
incorrect, that is, it deviates from the specified service. 

There are five major causes of system failure ("F") : 
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1 (Fl) permanent physical failures (changes) of its hardware 

2 components [1] ; 

3 

4 (F2) interference with the operation of the system by external 

5 environmental factors, such as cosmic rays, electromag- 

6 netic radiation, excessive temperature, etc.; 

7 

:Q 8 (F3) previously undetected design faults (also called "bugs", 

ra 9 "errata", etc.) in the hardware and software components 

m 

■C io of a computer system that manifest themselves during 

In 

*~ li operation [2-4] ; 

2 12 

5 : ; 

j« s 13 (F4) malicious actions by humans that cause the cessation or 

14 alteration of correct service: the introduction of 

15 computer "viruses", "worms", and other kinds of software 

16 that maliciously affects system operation [5] ; and 

27 

18 (F5) unintentional mistakes by human operators or maintenance 

19 personnel that lead to the loss or undesirable changes of 

20 system service. 

21 
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1 Commercial-off-the-shelf ("COTS") hardware components 

2 (memories, microprocessors, etc.) for computer systems have a 

3 low probability of failure due to failure mode Fl above [1] . 

4 They contain, however, very limited protection, or none at 

5 all, against causes F2 through F5 listed above [6, 7] . 

6 Accordingly the related art remains subject to major 

7 problems, and the efforts outlined in the cited publications 

O 

,p 8 — though praiseworthy — have left room for considerable 

in 

19 9 refinement. 

m 

m io 
frt 

<B n 

f ^ 

:z 12 

II 3 

]i 13 SUMMARY OF THE DISCLOSURE 

s : • 1 — — ■ 

? tsS 

15 The present invention introduces such refinement. In its 

16 preferred embodiments, the present invention has several as- 

17 pects or facets that can be used independently, although they 

18 are preferably employed together to optimize their benefits. 

19 

20 In preferred embodiments of its first major independent 

21 facet or aspect, the invention is apparatus for deterring fai- 

22 lure of a computing system. (The term "deterring" implies 
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that the computing system is rendered less probable to fail, 
but there is no absolute prevention or guarantee.) The appa- 
ratus includes an exclusively hardware network of components, 
having substantially no software. 

The apparatus also includes terminals of the network for 
connection to the system. In certain of the appended claims, 
this relationship is described as "connection to such system". 

(In the accompanying claims generally the term "such" is 
used, instead of "said" or "the", in the bodies of the claims, 
when reciting elements of the claimed in vention, for referring 
back to features which are introduced in preamble as part of 
the context or environment of the claimed invention. The pur- 
pose of this convention is to aid in more distinctly and em- 
phatically pointing out which features are elements of the 
claimed invention, and which are parts of its context — and 
thereby to more particularly claim the invention . ) 

The apparatus includes fabrication-preprogrammed hardware 
circuits of the network for guarding the system from failure. 
For purposes of this document, the term "fabrication-prepro- 
grammed hardware circuit" means an application-specific in- 
tegrated circuit (ASIC) or equivalent. 
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This terminology accordingly encompasses two main types 

2 of hardware: 

3 



4 
5 
6 



10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 



,1, a classical ASIC - a ™itary, special-purpose 

processor circuit, sometimes called a -sequencer", fabri- 
cated in such a way that it substantially can perform 
only on. program (though the program can be extremely 
complex, with many conditional branches and loops etc.); 
and 

,2) a general-purpose processor interlinked with a true read- 
only memory (MM - "true read-only" in the sens, that 
the memory circuit and its confnts substantially cannot 
be changed without destroying it - th. memory circuit 
being fabricated in such a way that it contains only one 
program (again, potentially quit, complicated) , which the 
processor performs. 

ordinarily either of these device types when powered up 
starts to execute it. program - which in essence is unalter- 
ably preprogrammed into the device at the time of manufacture. 
The program in th. second type of device configuration identi- 
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fied above, in which the processor reads out the program from 

2 an identifiably separate memory, is sometimes termed "firm- 

3 ware"; however, when a true ROM is used, the distinction be- 

4 tween firmware and ASIC is strongly blurred. 

5 The term " fabrication-preprogrammed hardware circuit" al- 

6 so encompasses all other kinds of circuits (including optical) 

7 that follow a program which is substantially permanently manu- 

JSC 

M 

ig 8 factured in. In particular this nomenclature explicitly en- 
IB 

\U 9 compasses any device so described, whether or not in existence 

*B io at the time of this writing. 
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12 The foregoing may represent a description or definition 

of the first aspect or facet of the invention in its broadest 
jl 14 or most general form. Even as couched in these broad terms, 

15 however, it can be seen that this facet of the invention 

16 importantly advances the art. 
In particular, through use of a protective system that is 

itself all hardware the probability of failure by previously 
19 mentioned failure (Fl) , (F2) , (F4) and (F5) in the protective 

system itself is very greatly reduced. Furthermore the proba- 
bility of failure by cause (F3) is rendered controllable by 
use of extremely simple hardware designs that can be qualified 
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quite completely. While these considerations alone cannot 
eliminate the possibility of failure in the guarded computing 
system, they represent an extremely important advance in that 
at least the protective system itself is very likely to be 
available to continue its protective efforts . 

Although the first major aspect of the invention thus 
significantly advances the art, nevertheless to optimize 
enjoyment of its benefits preferably the invention is prac- 
ticed in conjunction with certain additional features or 
characteristics. In particular, if the computing system is 
substantially exclusively made up of substantially commercial, 
off-the-shelf components, preferably at least one of the net- 
work terminals is connected to receive at least one error sig- 
nal generated by the computing system in event of incipient 
failure of that system; and at least one of the network termi- 
nals is connected to provide at least one recovery signal to 
the system upon receipt of the error signal. 

If that preference is observed, then a subsidiary pref- 
erence arises : preferably the circuits include portions that 
are fabrication-preprogrammed to evaluate the "at least one" 
error signal to establish characteristics of the at least one 
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j recovery signal. In other words, these circuits select or 

2 fashion the recovery signal in view of the character of the 

3 error signal . 

4 

5 For the first aspect of the invention introduced above, 

6 as noted already, the computing system as most broadly con- 

7 ceived is not a part of the invention but rather is an element 

8 of the context or environment of that invention. For a vari- 

9 ant form of the first aspect of the invention, however, the 

10 protected computing system is a part of an inventive combina- 

11 tion that includes the first aspect of the invention as broad- 

12 ly defined. 

13 This dual character is common to all the other aspects 

14 discussed below, and also to the various preferences stated 

15 for those other aspects: in each case a variant form of the 

16 invention includes the guarded computing system. In addition, 

17 as also mentioned above, a particularly valuable set of pref- 
ix erences for the first aspect of the invention consists of com- 

19 binations of that aspect with all the other aspects. 

20 These combinations include crosscombinations of the first 

21 aspect with each of the others in turn — but also include 

22 combinations of three aspects, four and so on. Thus the most 
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highly preferred form of the invention accordingly uses all of 
its inventive aspects. 

In preferred embodiments of its second major independent 
facet or aspect, the invention is apparatus for deterring fai- 
lure of a computing system. The apparatus includes a network 
of components having terminals for connection to the system, 
and circuits of the network for operating programs to guard 
the system from failure. 

The circuits in preferred embodiments of the second facet 
of the invention also include portions for identifying failure 
of any of the circuits and correcting for the identified fai- 
lure. (The "circuits" whose failure is identified and correc- 
ted for — in this second aspect of the invention — are the 
circuits of the network apparatus itself, not of the computing 
system. ) 

For the purposes of this document, the phrase "circuits 
. . . for operating programs" means either fabrication-pre- 
programmed hardware circuit, as described above, or a firm- 
ware- or even software-driven circuit, or hybrids of these 
types. As noted earlier, all-hardware circuitry is strongly 
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preferred for practice of the invention; however, the main as- 
pects other than the first one do not expressly require such 
construction. 

The foregoing may represent a description or definition 
of the second aspect or facet of the invention in its broadest 
or most general form. Even as couched in these broad terms, 
however, it can be seen that this facet of the invention 
importantly advances the art. 

In particular, as in the case of the first aspect of the 
invention, the benefits of this second aspect reside in the 
relative extremely high reliability of the protective apparat- 
us. Whereas the first aspect focuses upon benefits derived 
from the structural character — as such — of that apparatus, 
this second aspect concentrates on benefits that flow from 
self -monitoring and correction on the part of that apparatus. 

Although the second major aspect of the invention thus 
significantly advances the art, nevertheless to optimize en- 
joyment of its benefits preferably the invention is practiced 
in conjunction with certain additional features or charac- 
teristics. In particular, preferably the program-operating 
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portions include a section that corrects for the identified 
failure by taking a failed circuit out of operation. 

In event this basic preference is followed, a subpref- 
erence is that the program-operating portions include a 
section that substitutes and powers up a spare circuit for a 
circuit taken out of operation. Another basic preference is 
that the program-operating portions include at least three of 
the circuits; and that failure be identified at least in part 
by majority vote among the at least three circuits. 

The earlier-noted dual character of the invention — as 
having a variant that includes the computing system — applies 
to this second aspect of the invention as well as the first, 
and also to all the other aspects of the invention discussed 
below. Also applicable to this second facet and all the 
others is the pref erability of employing all the facets 
together in combination with each other. 

In preferred embodiments of its third major independent 
facet or aspect, the invention is apparatus for deterring 
failure of a computing system that has at least one software 
subsystem for conferring resistance to failure of the system; 
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the apparatus includes a network of components having termi- 
nals for connection to the system; and circuits of the network 
for operating programs to guard the system from failure. 

The circuits include substantially no portion that inter- 
5 feres with the failure -resistance software subsystem. The 
foregoing may represent a description or definition of the 
third aspect or facet of the invention in its broadest or most 
general form. Even as couched in these broad terms, however, 
it can be seen that this facet of the invention importantly 
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In particular, operation of this aspect of the invention 
advantageously refrains from tampering with protective fea- 
tures built into the guarded system itself. The invention 
thus takes forward steps toward ever-higher reliability with- 
out inflicting on the protected system any backward steps that 
actually reduce reliability. 

Although the third major aspect of the invention thus 
significantly advances the art, nevertheless to optimize en- 
joyment of its benefits preferably the invention is practiced 
in conjunction with certain additional features or charac- 
teristics. In particular, as before, a preferred variant of 
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the invention includes the protected computing system — here 
particularly including the at least one software subsystem. 

In preferred embodiments of its fourth major independent 
facet or aspect, the invention is apparatus for deterring fai- 
lure of a computing system that is substantially exclusively 
made of substantially commercial, off-the-shelf components and 
that has at least one hardware subsystem for generating a re- 
sponse of the system to failure. The apparatus includes a 
network of components having terminals for connection to the 
system; and circuits of the network for operating programs to 
guard the system from failure. 

The circuits include portions for reacting to the re- 
sponse of the hardware subsystem. (In the "Detailed Descrip- 
tion" section that follows, these portions may be identified 
as the so-called "M-nodes" and some instances of "D-nodes" . ) 

The foregoing may represent a description or definition 
of the fourth aspect or facet of the invention in its broadest 
or most general form. Even as couched in these broad terms, 
however, it can be seen that this facet of the invention im- 
portantly advances the art. 
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In particular, this facet of the invention exploits the 
hardware provisions of the protected computing system — i . e . 
the most reliable portions of that system — to establish when 
the protected system is actually in need of active aid. In 
earlier systems the only effort to intercede in response to 
such need was provided from the computing system itself; and 
that system, in event of need, was already compromised. 

Although the fourth major aspect of the invention thus 
significantly advances the art, nevertheless to optimize 
enjoyment of its benefits preferably the invention is prac- 
ticed in conjunction with certain additional features or 
characteristics. In particular, preferably the reacting 
portions include sections for evaluating the hardware- subsys- 
tem response to establish characteristics of at least one 
recovery signal. When this basic preference is observed, a 
subpref erence is that the reacting portions include sections 
for applying the at least one recovery signal to the system. 



In preferred embodiments of its fifth major independent 
facet or aspect, the invention is apparatus for deterring fai- 
lure of a computing system that is distinct from the apparatus 
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and that has plural generally parallel computing channels. 
The apparatus includes a network of components having termi- 
nals for connection to the system; and circuits of the network 
for operating programs to guard the system from failure. 

The circuits include portions for comparing computational 
results from the parallel channels. (In the "Detailed De- 
scription" section that follows, these portions may be identi- 
fied as the so-called "D-nodes" . ) 

The foregoing may represent a description or definition 
of the fifth aspect or facet of the invention in its broadest 
or most general form. Even as couched in these broad terms, 
however, it can be seen that this facet of the invention im- 
portantly advances the art. 

In particular, this facet of the invention takes favora- 
ble advantage of redundant processing within the protected 
computing system, actually applying a reliable, objective ex- 
ternal comparison of outputs from the two or more internal 
channels. The result is a far higher degree of confidence in 
the overall output. 

Although the fifth major aspect of the invention thus 
significantly advances the art, nevertheless to optimize en- 
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joyment of its benefits preferably the invention is practiced 
in conjunction with certain additional features or character- 
istics. In particular, preferably the parallel channels of 
the computing system are of diverse design or origin; when 
outputs from parallel processing within architecturally and 
even commercially diverse subsystems are objectively in agree- 
ment, the outputs are very reliable indeed. 

Another basic preference is that the comparing portions 
include at least one section for analyzing discrepancies be- 
tween the results from the parallel channels. If this pref- 
erence is in effect, then another subsidiary preference is 
that the comparing portions further include at least one sec- 
tion for imposing corrective action on the system in view of 
the analyzed discrepancies. In this case a still further 
nested preference is that the at least one discrepancy-analyz- 
ing section uses a majority voting criterion for resolving 

17 discrepancies. 

When the parallel channels of the computing system are of 
diverse design or origin — a preferred condition, as noted 
above — it is further preferable that the comparing portions 
include circuitry for performing an algorithm to validate a 
match that is inexact . This is preferable because certain 
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types of calculations parked by diverse plural systems ara 
likely to produce slightly divergent results, .van whan the 
calculations in the plural channals ara performed correctly. 

in the case of such inexactness-permissive matching, a 
number of alternative preferences coma into play for accommo- 
dating the type of calculation actually involved. One is that 
tna algorithm-performing circuitry preferably employs a degree 
of inexactness suited to a type of computation under compari- 
son; an alternative is that the algorithm-performing circuitry 
performs an algorithm which selects a degree of inexactness 
'0 „ based on type of computation under comparison. 
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In preferred embodiments of its sixth major independent 
facet or aspect, the invention is apparatus for deterring fai- 
lure of a computing system that has plural processors; the ap- 
paratus includes a network of components having terminals for 
connection to the system; and circuits of the network for op- 
erating programs to guard the system from failure. 

The circuits include portion, for identifying failure of 
any of the processor, and correcting for identified failure. 
(I n the "Detailed Description" section that follows, these 
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portions may be identified as the so-called M M-nodes" and some 
instances of "D-nodes".) 

The foregoing may represent a description or definition 
of the sixth aspect or facet of the invention in its broadest 
or most general form. Even as couched in these broad terms, 
however, it can be seen that this facet of the invention im- 
portantly advances the art. 

In particular, whereas the fifth aspect of the invention 
advantageously addresses the functional results of parallel 
processing in the protected system, this sixth facet of the 
invention focuses upon the hardware integrity of the parallel 
processors . This focus is in terms of each processor indi- 
vidually, as distinguished from the several processors consid- 
ered in the aggregate, and thus beneficially goes to a level 
of verification not heretofore found in the art. 

Although the sixth major aspect of the invention thus 
significantly advances the art, nevertheless to optimize en- 
joyment of its benefits preferably the invention is practiced 
in conjunction with certain additional features or character- 
istics. In particular, preferably the identifying portions 
include a section that corrects for the identified failure by 
taking a failed processor out of operation. 

AviSienis, Ph. D. / xAAA-02 21 P. Lippman / June 20, 200 



When this basic preference is actualized, then a subpref- 
erence is applicable: preferably the section includes parts 
for taking a processor out of operation only in case of sig- 
nals indicating that the processor has failed permanently. 
Another basic preference is that the identifying portions in- 
clude a section that substitutes and powers up a spare circuit 
for a processor taken out of operation. 



In preferred embodiments of its seventh major independent 
facet or aspect, the invention is apparatus for deterring fai- 
lure of a computing system. The apparatus includes a network 
of components having terminals for connection to the system; 
and circuits of the network for operating programs to guard 
the system from failure. 

The circuits include modules for collecting and respond- 
ing to data received from at least one of the terminals. The 
modules include at least three data-collecting and -responding 
modules, and also processing sections for conferring among the 
modules to determine whether any of the modules has failed. 

The foregoing may represent a description or definition 
of the seventh aspect or facet of the invention in its broad- 
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est or most general form. Even as couched in these broad 
terns, however, it can be seen that this facet of the inven- 
tion importantly advances the art. 

in particular, whereas the earlier-discussed fifth aspect 
of the invention enhances reliability through comparison of 
processing results among subsystems within the protected com- 
puting system, this seventh facet of the invention looks to 
comparison of modules in the protective apparatus itself - to 
attain an analogous upward step in reliability of the hybrid 

overall system. 

Although the seventh major aspect of the invention thus 
significantly advances the art, nevertheless to optimize en- 
joyment of its benefits preferably the invention is practiced 
in conjunction with certain additional features or charac- 
teristics, in particular, these preferences as mentioned ear- 
lier include crosscombinations of the several facets or as- 
pects, and also the dual character of the invention - i^, 
encompassing a variant overall combination which includes the 
19 protected computing system. 
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In preferred embodiments of its eighth major independent 
facet or aspect, the invention is apparatus for deterring fai- 
lure of a computing system. The latter system is substantial- 
ly exclusively made of substantially commercial, off-the-shelf 
components, and has at least one subsystem for generating a 
response of the system to failure — and also has at least one 
subsystem for receiving recovery commands . 

The apparatus includes a network of components having 
terminals for connection to the system between the response- 
generating subsystem and the recovery-command-receiving sub- 
system. It also has circuits of the network for operating 
programs to guard the system from failure. 

The circuits include portions for interposing analysis 
and a corrective reaction between the response-generating sub- 
system and the command-receiving subsystem. The foregoing may 
represent a description or definition of the eighth aspect or 
facet of the invention in its broadest or most general form. 
Even as couched in these broad terms, however, it can be seen 
that this facet of the invention importantly advances the art. 

In particular, earlier fault-deterring efforts have con- 
centrated upon feeding back corrective reaction within the 
protected system itself. Such prior attempts are flawed in 
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that generally commercial, off-the-shelf systems intrinsically 
lack both the reliability and the analytical capability to po- 
lice their own failure modes. 

Although the eighth major aspect of the invention thus 
significantly advances the art, nevertheless to optimize en- 
joyment of its benefits preferably the invention is practiced 
in conjunction with certain additional features or character- 
istics, in particular, preferably the general preferences 
ffi entioned above <e_<u as to the seventh facet) are equally 
applicable here. 



All of the foregoing operational principles and advantag- 
es of the present invention will be more fully appreciated 
upon consideration of the following detailed description, with 
reference to the appended drawings, of which: 
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BRIEF DESCRIPTION OF THE DRAWINGS 



Fig. 1 is a partial block diagram, very schematic, of a 
two-ring architecture used for preferred embodiments of the 
invention ; 

Fig. 2 is a like view, but expanded, of the inner ring 
including a group of components called the "M-cluster"; 

Fig. 3 is an electrical schematic of an n-bit comparator 
and switch used in preferred embodiments; 

Fig. 4 is a set of two like schematics — Fig. 4a showing 
one "A-node" or "A-port" (namely the "a" half of a self-check- 
ing A-pair "a" and "b") , and Fig. 4b showing connections of 
A-nodes "a" and M b" with their C-node; 

Fig. 5 is a like schematic showing one M-node (monitor 
node) from a five-node M-cluster; 

Fig. 6 is a view like Figs. 1 and 2, but showing the core 
of the M-cluster; 

Fig. 7 is a schematic like Figs. 3 through 5 but showing 
one self-checking S3-node (b-side blocks not shown) in a total 
set of four S3-nodes; 

Fig. 8 is a set of three flow diagrams — Fig. 8a showing 
a power-on sequence for the M-cluster, controlled by S3-nodes, 
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Fig . 8 b showing a power-on sequence for the outer ring (one 
node) , controlled by an M-cluster, and Fig. 8c showing a pow- 
er-off sequence for the invention; 

Fig. 9 is a schematic like Figs. 3 through 5, and 7, but 
showing one of a self -checking pair of D-nodes, namely node 
«a" (the identical twin D-node «b« not shown) ; and 

Fig. 10 is a block diagram, highly schematic, of a fault- 
tolerant chain of interstellar spacecraft embodying certain 
features of the invention. 

A key to symbols and callouts used in the drawings ap- 
pears at the end of this text, preceding the claims. 



TYF.TAILED ng.sCRIPTION 

OF THE PPV-VERRED F -MRQDIMENTS 

1. SYSTEM ELEMENTS 



Preferred embodiments of the present invention provide a 
so-called "fault-tolerance infrastructure" (FTI) that is a 
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j system composed of four types of special-purpose controllers 

2 which will be called "nodes". The nodes are ASICs (applica- 

3 tion-specif ic integrated circuits) that are controlled by 

4 hardwired sequencers or by microcode. 

5 The preferred embodiments employ no software. The four 

6 kinds of nodes will be called: 

7 

8 (1) A-nodes (adapter nodes) ; 

9 (2) M-nodes (monitor nodes) ; 

io (3) D-nodes (decision nodes) ; and 

n (4) s3-nodes (startup, shutdown, and survival nodes) . 

12 

13 The purpose of the FT I is to provide protection against 

14 all five causes of system failure for a computing system that 

15 can be substantially conventional and composed of COTS compo- 

16 nents, called C-nodes (computing nodes) . Merely for the sake 

17 of simplicity — and tutorial clarity in emphasizing the capa- 

18 bilities of the invention — this document generally refers to 

19 the C-nodes as made up of COTS components, or as a "COTS sys- 

20 tern"; however, it is to be understood that the invention is 

21 not limited to protection of COTS systems and is equally ap« 

22 plicable to guarding custom systems. 
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The C-nodes are connected to the A-nodes and D-nodes of 
the FTI in the manner described subsequently. The C-nodes can 
be COTS microprocessors, memories, and components of the sup- 
porting chipset in the COTS computer system that will be 
called the "client system" or simply the "client" . 

The following protection for the client system is provi- 
ded when it is connected to the FTI. 

(1) The FTI provides error detection and recovery support 

when the client COTS system is affected by physical fai- 
lures of its components (Fl) and by external interference 
(F2) . The FTI provides power switching for unpowered 
spare COTS components of the client system to replace 
failed COTS components (Fl) in long-duration missions. 

(2) The FTI provides a "shutdown-hold-restart" recovery se- 
quence for catastrophic events (F2, F3, F4) that affect 
either the client COTS system or both the COTS and FTI 
systems. Such events are: a "crash" of the client COTS 
system software, an intensive burst of radiation, tempo- 
rary outage of client COTS system power, etc. 
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1 (3) The FTI provides (by means of the D-nodes) the essential 

2 mechanisms to detect and to recover from the manifesta- 

3 tions of software and hardware design faults (F3) in the 

4 client system. 

5 This is accomplished by the implementation of design 

6 diversity [3, 4] . Design diversity is the implementation 

7 of redundant channel computation (duplication with com- 

*fl e parison, triplication with voting, etc.) in which each 

10 

j*f 9 channel ( i . e . C-node) employs independently designed 

IP 

io hardware and software , while the D-node serves as the 

u\ 
,Q 

5 ii comparator or voter element. Design diversity also pro- 

jfj 12 vides detection and neutralization of malicious software 

\%\ 

13 13 (F4) and of mistakes (F5) by operators or maintenance 

13 

=^ 14 personnel [5] . 

15 

16 Finally, the nodes and interconnections of the FTI are 

17 designed to provide protection for the FTI system itself as 

18 follows. 

19 

20 (1) Error detection and recovery algorithms are incorporated 

21 to protect against causes (Fl) and (F2) . 

22 
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(2) The absence of software in the FT I provides immunity 
against causes (F4) and (F5) . 



(3) The overall FT I design allows the introduction of diverse 
hardware designs for the A-, M-, S3-, and D-nodes in or- 
der to provide protection against cause (F3) , i . e . hard- 
ware design faults . Such protection may prove not be 
necessary, since low complexity of the node structure 
should allow complete verification of the node designs. 

When interconnected in the manner described below, the 
FT I and the client COTS computing system form a high-perfor- 
mance computing system that is protected against all five 
system failure causes (F1)-(F5). For purposes of the present 
document this system will be called a "diversif iable self- 
testing and -repairing system" ("DiSTARS") . 

2. ARCHITECTURE OF DiSTARS 

(a) The DiSTARS Configuration — The structure of a pre- 
ferred embodiment of DiSTARS conceptually consists of two con- 
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centric rings (Fig. 1) : an Outer Ring and an Inner Ring. The 
Outer Ring contains the client COTS system, composed of Com- 
puting Nodes or C-nodes 11 (Fig. 1) and their System Bus 12. 

The C-nodes are either high-performance COTS processors 
< e . a . Pentium II) with associated memory, or other COTS ele- 
ments from the supporting chipset (I/O controllers, etc.), and 
other subsystems of a server platform [8] . The Outer Ring is 
supplemented with custom-designed Decision Nodes or «D-nodes» 
13 that communicate with the C-nodes via the System Bus 12. 
The D-nodes serve as comparators or voters for inputs provided 
by the C-nodes. They also provide the means for the C-nodes 
to communicate with the Inner Ring. Detailed discussion of 
the D-node is presented later. 

The Inner Ring is a custom-designed system composed of 
Adapter Nodes or »A-nodes" 14 and a cluster of Monitor Nodes, 
or "M-nodes", called the M-cluster 15. The A-nodes and the 
M-nodes communicate via the Monitor Bus or «M-bus" 16. Every 
A-node also has a dedicated A-line 17 for one-way communica- 
tion to the M-nodes. The custom-designed D-nodes 13 of the 
Outer Ring contain embedded A- P orts 18 that serve the same 
purpose as the external A-nodes of the C-node processors. 
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The M-cluster serves as a fault-tolerant controller of 
recovery management for the C- and D-nodes in the Outer Ring. 
The M-cluster employs hybrid redundancy (triplication and vot- 
ing, with unpowered spares) to assure its own continuous 
availability. It is an evolved descendant of the Test-and- 
Repair processor of the JPL-STAR computer [9] . Two dedicated 
A-nodes are connected to every C-node, and every D-node con- 
tains two A-ports . The A-nodes and A-ports serve as the input 
and output devices of the M-cluster: they relay error signals 
and other relevant outputs of the C- and D-nodes to the M- 
cluster and return M-cluster responses to the appropriate C- 
or D-node inputs . 

The custom-designed Inner Ring and the D-nodes provide an 
FT I that assures dependable operation of the client COTS com- 
puting system composed of the C-nodes. The infrastructure is 
generic; that is, it can accommodate any client system (set of 
Outer Ring C-node chips) by providing them with the A-nodes 
and storing the proper responses to A-node error messages in 
the M-nodes . Fault- tolerance techniques are extensively used 
in the design of the infrastructure's components. 

The following discussion explains the functions and 
structure of the inner ring elements (Fig. 2) — particularly 
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1 the A- and M-nodes, the operation of the M-cluster, and the 

2 communication between the M-cluster and the A-nodes. Unless 

3 explicitly stated otherwise, the A-ports are structured and 

4 behave like the A-nodes. The D-nodes are discussed in Section 

5 3 below. 

6 

7 (b) The Adapter Nodes (A-Nodes) and A-lines — The pur- 

O 

^0 8 pose of an A-node (Fig. 4a) is to connect a particular C-node 

ia 

^ 9 to the M-cluster that provides Outer Ring recovery management 
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for the client COTS system. The functions of an A-node are 



1 . transmit error messages that are originated by its C-node 



P 13 

IstJ 

1^ 14 to the M-cluster; 



2 . transmit recovery commands from the M-cluster to its 



1 7 C-node ; 



3. control the power switch of the C-node and its own fuse 
according to commands received from the M-cluster; and 



4. report its own status to the M-cluster. 
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Every C-node is connected to an A-pair that is composed 
of two A-nodes, three CS units CS1, CS2 , CS3 (Fig. 4b), one OR 
Power Switch 415 that provides power to the C-node and one 
Power Fuse 416 common to both A-nodes and the CS units. The 
internal structure of a CS unit is shown in Fig. 3. The two 
A-nodes (Fig. 4a) of the A-pair have, in common, a unique 
identification or "ID" code 403 that is associated with their 
C-node; otherwise, all A-nodes are identical in their design. 
| 9 They encode the error signal outputs 431 of their C-node and 
decode the recovery commands 407 to serve as inputs 441a to 
the comparator CSl that provides command inputs to the C-node. 

As an example, consider the Pentium II processor as a 
C-node. It has five error signal output pins: AERR (address 
parity error) , BINIT (bus protocol violation) , BERR (bus non- 
protocol error) , IERR (internal non-bus error) , and THERMTRIP 
(thermal overrun error) which leads to processor shutdown. It 
is the function of the A-pair to communicate these signals to 
the M-cluster. The Pentium II also has six recovery command 
input pins: RESET, INIT (initialize), BINIT (bus initialize), 
FLUSH (cache flush) , SMI (system management interrupt) , and 
NMI (non-maskable interrupt) . The A-pair can activate these 
inputs according to the commands received from the M-cluster. 
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Each A-node has a separate A-line 444a, 444b for messages 
to the M-cluster. The messages are: 



(1) All is well, C-node powered, 

(2) All is well, C-node unpowered, 

(3) M-bus request, 

(4) Transmitting on M-bus, and 

(5) Internal A-node fault. 



All A-pairs of the Inner Ring are connected to the M-bus, 
which provides two-way communication with the M-cluster as 
discussed in the next subsection. 

The outputs 441a, 441b (Fig. 4b) of the A-pair to the 
C-node, outputs 442a, 442b to the C-node power switch and 
outputs 445a, 445b to the M-bus are compared in Comparator 
circuits CS1, CS2 , CS3. In case of disagreement, the outputs 
441, 442, 445 are inhibited (assume the high- impedance third 
state Z) and an "Internal fault" message is sent on the two 
A-lines 444a, 444b (Fig. 4a) . The single exception is the 
C-node Power-Off command. One Power-Off command is sufficient 
to turn C-node power 446 (Fig. 4b) off after the failure of 
one A-node in the pair. 
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The A-pair remains powered by Inner Ring power 426 when 
Outer Ring power 446 to its C-node is off — L-^- when the 
C-node is a spare or has failed. The failure of one A-node in 
the self -checking A-pair turns off the power of its C-node. A 
fuse 416 is used to remove power from a failed A-pair, thus 
protecting the M-bus against "babbling" outputs from the 
failed A-pair. Clock synchronization signals 425a (Fig. 4a) 
are delivered from the M-cluster. The low complexity of the 
A-node allows the packaging of the A-pair and power switch as 
one IC device. 

(c) The Monitor <M-1 Node s . M-Cluster and M-Bus — The 
purpose of the Monitor Node (M-node, Fig. 5) is to collect 
status and error messages from one or more (and in the aggre- 
gate all) A-nodes, to select the appropriate recovery action, 
and to issue recovery-implementing commands to the A-node or 
nodes via the Monitor Bus (M-Bus) . To assure continuous 
availability, the M-nodes are arranged in a hybrid redundant 
M-cluster — with three powered M-nodes in a triplication-and- 
voting mode, or as it is often called "triple modular redun- 
dancy" (TMR) ; and also with unpowered spare M-nodes . The vot- 
ing on output commands takes place in Voter logic 410 (Fig. 
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4a) located in the A-nodes . A built-in self -test (BIST) se- 
quence 408 is provided in every M-node. 

The M-bus is controlled by the M-cluster and connected to 
all A-nodes, as discussed in the previous section. All messa- 
ges are error-coded, and spare bus lines are provided to make 
the M-bus fault-tolerant. Two kinds of messages are sent to 
the A-pairs by the M-cluster: (1) an acknowledgment of A-pair 
request (on their A-lines 444a, 444b) that allocates a time 
slot on the M-bus for the A-pair error message; and (2) a com- 
mand in response to the error message. 

An M-node stores two kinds of information: static (per- 
manent) and dynamic. The static (ROM) data 505 (Fig. 5) con- 



13 sist of 



(1) predetermined recovery command responses to A-pair error 



16 messages , 

17 



(2) sequences for M-node recovery and replacement in the 
hybrid-redundant M-cluster, and 

(3) recovery sequences for catastrophic events — discussed 
in subsection 2 (f ) . 
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The dynamic data consist of: 

(1) Outer Ring configuration status 504 (active, spare, 
failed node list) , 

(2) Inner Ring configuration status 503 and system time 502, 

(3) a "scratchpad" store 501, 506, 507, 509, 510 for current 
activity: error messages still active, requests waiting, 
etc. , and 



S 12 (4) an Inner Ring activity log (also in 506) 



The configuration status and system time are the critical data 
that are also stored in nonvolatile storage in the S3 nodes of 
the Cluster Core — discussed in subsection 2 (d) . 

As long as all A-nodes continue sending "All is well" 
messages on their A-lines (525 through 528 and so on) , the 
M-cluster issues 541 "All is well" acknowledgments. When an 
"M-bus request" message arrives on two A-lines that come from 
a single A-pair that has a unique C-node ID code, the M-clus- 
ter sends 541 (on the M-bus) the C-node ID followed by the 
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transmit" o— • * ' the R ^ Sir SendS 5 " *"* 

„- b us, its C-node ID followed by an Error code originated by 
the C-node. The M-nodes return 54! the C-node ZD followed by 
. Recovery command for th. C-node. The A-pair transmit, the 
command to the C-node and returns 522 an acknowledgment: its 
C-node ID followed by the command it forwarded to the C-node. 
A t the times when an A-pair sends a message on th. M-bus, its 
A -lines send the -Transmitting" status report. This feature 
allows the cluster to detect cases in which a wrong A-pair 
responds on the M*». The A-*air also sends an Error message 
on that bus if its voters detect disagreements between the 
three M-cluster messages received on the M-bus. 

Shen the A-pair comparators CS1, CS2, CS3 (Fig. 3b) de- 
tect a disagreement, the A-lines send an "Internal Fault" mes- 
sage to the M-cluster, which responds (on the M-bus, with the 

■ w« "Reset A-pair" command. Both of the 

C-node ID followed by the Reset «. 

A -nodes of th. A-pair attempt to reset to an initial Stat., 
but do not change the setting of th. C-node power switch. 
Success causes "All is well" to be sent on the A-lines to th. 
muster, in case of failure to reset, the A-lines continue 
sending th. "Internal Fault" message. 
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The Ifr-cluster sends "Power On" and " Power Off" commands 
522 (Fig. 5) as part of a replacement or reconfiguration se- 
quence for the C-nodes. They are acknowledged immediately but 
power switching itself takes a relatively long time. When 
switching is completed, the A-pair issues an "M-bus Request" 
on its A-lines and then reports 522 on the M-bus the success 
(or failure) of the switching to the M-cluster via the M-bus. 

When the M-cluster determines that one A-node of an 
A-pair has permanently failed, it sends an "A-pair Power Off" 
message 541 to that A-pair. The good A-node receives the mes- 
sage, turns C-node power 446 (Fig. 4b) off — if it was on — 
and then permanently opens (by 443a or 443b) the A-pair power 
fuse 416. The M-cluster receives confirmation via the A-lines 
444a, 444b, (Fig. 4a) which assume the "no power" state. This 
irreversible command is also used when a C-node fails perma- 
nently and must be removed from the Outer Ring. 

(d) The M-Cluster Core — The Core (Fig. 6) of the ear- 
lier-introduced M-cluster (Fig. 2) includes a set of S3-nodes 
(Fig. 7) and communication links. As mentioned earlier, "S3" 
stands for Startup, Shutdown, Survival) . The M-nodes (Fig. 5) 
have dedicated "Disagree" 545, "Internal Error" 544 and "Re- 
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placement Request" 543 outputs to all other M-nodes and to the 
S3-nodes. The IntraCluster-Bus or IC-Bus 602 (Fig. 6) inter- 
connects all M-nodes . 

The purpose of the S3 nodes is to support the survival of 
Di STARS during catastrophic events, such as intensive bursts 
of radiation or temporary loss of power. Every S3-node is a 
self-checking pair with its own backup (battery) power 707 
(Fig. 7) . At least two S3 nodes are needed to attain fault- 
tolerance, and the actual number needed depends on the mission 
length without external repair. 

The functions of the S3 nodes are to: 

(1) execute the "power-on" and "power-off" sequences (Fig. 8) 
for DiSTARS; 

(2) provide fault-tolerant clock signals 720 (Fig. 7) ; 

(3) keep System Time 702a and System Configuration 704a, 705a 
data in nonvolatile, radiation-hardened registers; and 
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(4) control M-node power switches 511 (Fig. 5) , and I-Ring 
power 450 (Fig. 4b) to the A-pairs, in order to support 
M-cluster recovery. 

More details of S3-node operation follow in subsection 2 (f ) . 

Each self-checking S3 node has its own clock generator 
701 (Fig. 7) . The hardware-based fault- tolerant clocking 
system developed at the C. S. Draper Laboratory [10] is the 
most suitable for the M-cluster. 

(e) Error Detection and Recovery in the M-cluster — At 
the outset, the three powered M-nodes 201a, 201b, 201c (Fig. 
2) are in agreement and contain the same dynamic data. They 
operate in the triple modular redundancy (TMR) mode. Three 
commands are issued in sequence on the M-bus 202 and voted 
upon in the A-nodes 410 (Fig. 4a) . During operation of the 
M-cluster, one M-node may issue an output different from the 
other two, or one M-node may detect an error internally and 
send an "Internal Error" signal on a dedicated line 544 (Fig. 
5) to the other M-nodes. The cause may be either a "soft" 
error due to a transient fault, or a "hard" error due to phys- 
ical failure . 
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M-node output disagreement detection in the TMR mode 
(when one *fr-node is affected by a fault) works as follows. 
The three M-nodes 201a, 201b, 201c (Fig. 2) place their out- 
puts on the M-bus 202 in a fixed sequence. Each fc^node com- 
pares its output to the outputs of the other two nodes, re- 
cords one or two disagreements, and sends one or two "Disa- 
gree" messages to the other M-nodes on a dedicated line 545 
(Fig. 5) . The affected M-node will disagree twice, while the 
good M-nodes will disagree once each and at the same time, 
which is the time slot of the affected M-node. 

Following error detection, the following recovery se- 
quence is carried out by the two good M-nodes . 

(1) Identify the affected M-node or the M-node that sent the 
Internal Error message, and enter the Duplex Mode of the 
M-cluster . 

(2) Attempt "soft" error recovery by reloading the dynamic 
data of the affected M-node from the other two M-nodes 
and resume TMR operation. 
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1 (3) If Step (2) does not lead to agreement, send request for 

2 replacement 543 (Fig. 5) of the affected M-node to the 

3 S3-nodes . 

4 

5 (4) The S3-nodes replace the affected M-node and send "Resume 

6 TMR" command 726 (Fig. 7) . 

7 

%B 8 (5) Load the new M-node with dynamic data from the other two 

IS 

\U 9 M-nodes and resume TMR operation. 

in 

~ ii During the recovery sequence, the two good (agreeing) 

13 

jjt 12 M-nodes 601a, 601b (Fig. 6) operate in the Duplex Mode, in 

iu 

Q 13 which they continue to communicate with the A-nodes and con- 

\^ 14 currently execute the recovery steps (2) through (5) . The 

15 Duplex Mode becomes the permanent mode of operation if only 

16 two good M-nodes are left in the M-cluster. Details of the 

17 foregoing M-cluster recovery sequence are discussed next. 

18 

19 Step (1) : Entering Duplex Mode . The simultaneous di s agree - 

20 ment 527 (Fig. 5) by the good M-nodes 601a, 601b (Fig. 6) 

21 during error detection causes the affected M-node cl to enter 

22 the "Hold" mode, in which it inhibits its output 541 (Fig. 5) 
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to the M-bus and does not respond to inputs on the A-lines . 
It also clears its "Disagree" output 645. If the affected 
node 601c (Fig. 6) does not enter the "Hold" mode, step (3) is 
executed to cause its replacement. An M-node similarly enters 
the "Hold" mode when it issues an Internal Error message 544 
(Fig. 5) to the other two Anodes, which enter the Duplex Mode 
at that time. It may occur that all three M-nodes disagree, 
i . e . , each one issues two "Disagree" signals 545, or that two 
or all three M-nodes signal Internal Error 544. These cata- 
strophic events are discussed in subsection 2 (f ) . 

The two good M-nodes 601a, 601b (Fig. 6) still send three 
commands to the A-nodes in Duplex Mode during steps (2) -(5) . 
During tl and t2 they send their outputs to the M-bus and 
compare. An agreement causes the same command to be sent 
during t3; disagreement invokes a retry, then catastrophic 
event recovery. The good M-nodes continue operating in Duplex 
Mode if a spare M-node is not available after the affected 
node has been powered off in step (3) . TMR operation is 
permanently degraded to Duplex in the M-cluster. 

Step (2) : Reload Dynamic Data of the Affected M-node (assum- 
ing M-node 601c [Fig. 6] is affected) . An IntraCluster Bus or 
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IC-bus 2 is used for this purpose. At times tl and t2 the 
good M-nodes 601a , 601b place the corresponding dynamic data 
on the IC-Bus 602; at time t3 the affected node 601c compares 
and stores it. The good nodes also compare their outputs. 
Any disagreement causes a repetition of times tl , t2 , t3 . A 
further disagreement between good nodes is a catastrophic 
event. After reloading is completed, it is validated: the 
affected node reads out its data, and the good nodes compare 
it to their copies. A disagreement leads to step (3), i ■ e . 
power-off for the affected node; otherwise the M-cluster 
returns to TMR operation. 

Steps (3) and (4) : Power Switching . Power switching 511 
(Fig. 5) is a mechanism for removing failed M-nodes and bring- 
ing in spares in the M-cluster. Failed nodes with power on 
can lethally interfere with M-cluster functioning; therefore 
very dependable switching is essential. The power- switching 
function 730 (Fig. 7) is performed by the S3-nodes in the 
Cluster Core. They maintain a record of M-cluster status in 
nonvolatile storage 705a. Power is turned off for the failed 
*^-node, the next spare is powered up, BIST is executed, and 
the "Resume TMR" command 530 (Fig. 5) is sent to the M-nodes. 
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f51 : Tiding a New M- node. When 
ffi and of step (4) is received, the new M-node must receive the 
dynamic data from the two good Anodes. The procedure is the 

same as step (2) . 



{f ) ^rv after n.-.i-^phin Events — Up to this 
point recovery has been defined in response to an error signal 
from one C-node, A-node, or M-node for which the M-cluster had 
a predetermined recovery command or sequence. These recover- 
ies are classified as local and involve only one node. 

It is possible, however, for error signals to originate 
from two or more nodes concurrently (or close in time) . A few 
such cases have been identified as -catastrophic" events 
(c-events) in the preceding discussion. It is not practical 
to predetermine unique recovery for each c-event; therefore, 
raore general catastrophe-recovery (c-recovery) procedures must 
be devised. 

in general, I can distinguish c-^vents that affect the 
Outer Ring only, and events that affect the Inner Ring as 
well. For the Outer Ring a c^vent is a crash of system soft- 
ware that requires a restart with Inner Ring assistance. The 
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1 Inner Ring does not employ software, thus assuming well proven 

2 ASIC programming its crash cannot occur in the absence of 

3 hardware failure (Fl) , (F2) . 

4 There are, however, adverse physical events of the (Fl) 

5 and (F2) types that can cause c-events for the entire DiSTARS. 

6 Examples are: (1) external interference by radiation; (2) 

7 fluctuations of ambient temperature; (3) temporary instability 
■p 8 or outage of power; (4) physical damage to system hardware. 

m 

IB 9 The predictable manifestations of these events in DiSTARS 

m 

l -Q io are: (1) halt in operation due to power loss; (2) permanent 

if i 

' y ii failures of system components (nodes) and/or communication 

i 

13 

p 12 links; (3) crashes of Outer Ring application and system soft- 

P 13 ware; (4) errors in or loss of M-node data stored in volatile 

\j 

U 14 storage; (5) numerous error messages from the A-nodes that ex- 

15 ceed the ability of M-cluster to respond in time; (6) double 

16 or triple disagreements or Internal Error signals in the 
2 7 M-cluster TMR or Duplex Modes. 

18 The DiSTARS embodiments now most highly preferred employ 

19 a System Reset procedure in which the S3 -nodes execute a "pow- 

20 er-off" sequence (Fig. 8c) for DiSTARS on receiving a c-event 

21 signal either from sensors (radiation level, power stability, 

22 etc.) or from the M-nodes. System Time 702a (Fig. 7) and 
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DiSTARS configuration data 704a, 705a are preserved in the ra- 
diation-hardened, battery-powered S3-nodes. The "power-on" 
sequence (Pigs. 8a, 8b) is executed when the sensors indicate 
a return to normal conditions. 

Outer Ring power is turned off when the S3-node sends the 
signal 729 (Fig. 7) to remove power from the A-pairs, thus 
setting all C-node switches to the "Off" position. M-node 
power is directly controlled by the S3-node output 730. 

The "power-on" sequence for M-nodes (Fig. 8a) begins with 
the S3-nodes applying power and executing BIST to find three 
or two good M-nodes, loading them via the IC-Bus with critical 
data, then applying I-Ring power to the A-pairs. The sequence 
continues with sending the "Outer Ring Power On" command 727 
\1 u (Fig. 7) to the M-cluster. 

To start the "power on" sequence for C- and D-nodes (Fig. 
8b) the M-cluster commands (on the M-bus) "Power-On" followed 
by BIST sequentially for the C-nodes and D-nodes of the Outer 
Ring, and the system returns to an operating condition, 
possibly having lost some nodes due to the catastrophic event. 

Currently preferred embodiments are equipped with only 
the "power-off" sequence to respond to c-events. The inven- 
tion, however, contemplates introducing less drastic and fas- 
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ter recovery sequences for some less harmful c-events . Ex- 
periments in progress with the prototype DiSTARS system ad- 
dress development of such sequences . 



3. THE DECISION (D-) NODES AND DIVERSIFICATION 



(a) The rationale for D-Nodes — The A-nodes in the dis- 
cussion thus far have been the only means of communication 
between the Inner and Outer Rings , and they convey only very 
specific C-node information. A more-general communication 
link is needed. The Outer Ring may need configuration data 
and activity logs from the M-cluster, or to command the pow- 
ering up or down of some C-node s for power management reasons . 
An In ter Ring communication node beneficially acts as a link 
between the System Bus of the Outer Ring and the M— bus of the 
Inner Ring. 

A second need of the Outer Ring is enhanced error detec- 
tion coverage. For example, as described in subsection 2(b), 
the Pentium II has only five error- signal outputs of very 
general nature, and in a recent study [6, 7] their coverage 
was estimated to be very limited. The original design of the 
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P6 family of Intel processors included the FRC (functional 
redundancy checking) mode of operation in which two processors 
could be operated in the Master/Checker mode, providing very 
good error confinement and high error detection coverage. De- 
tection of an error was indicated by the FRCERR signal. Quite 
surprisingly and without explanation, the FRCERR pin was re- 
moved from the specification in April 1998, thus effectively 
canceling the use of the FRC mode long after the P6 processors 
reached the market. 

In fairness it should be noted that other processor ma- 
kers have never even tried to provide Master /Checker duplexing 
for their high-performance processors with low error detection 
coverage. An exception is the design of the IBM G5 and G6 
processors [7] . 

This observation explains the inclusion of a custom Deci- 
sion Node (D-node) on the Outer Ring System Bus that can serve 
as an external comparator or voter for the C-node COTS proces- 
sors. It is even more important that the D-node also be able 
to support design diversity by providing the appropriate de- 
cision algorithms for N-version programming [4] employing di- 
verse processors as the C-nodes of the Outer Ring. 
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# 



1 The use of processor diversity has become important for 

2 dependable computing because contemporary high-performance 

3 processors contain significant numbers of design faults. For 

4 example, a recent study shows that in the Intel P6 family 

5 processors from forty-five to 101 design faults ("errata") 

6 were discovered (as of April 1999) after design was complete, 

7 and that from thirty to sixty of these design faults remain in 

(3 

s q 8 the latest versions ("steppings") of these processors [2] . 

IS 

\u 9 

;D 10 (b) Decision Node (D-Node) Structure and Functions — 

III 

^ li The D-nodes (Fig. 9) need to be compatible with the C-nodes on 

:;: 12 the System Bus and also embed Adapter (A-) Ports analogous to 

j=l 13 the A-nodes that are attached to C-nodes. The functions of 

n 

U 14 the D-nodes are: 

15 

16 (1) to transmit messages originated by C-node software to the 

17 M-cluster; 

18 

19 (2) to transfer M-cluster data to the C-nodes that request 

20 it; 

21 
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(3) 



to accept C-node outputs for comparison or voting 
return the results to the C-nodes; 



(4) to provide a set of decision algorithms for N-version 
software executing on diverse processors (C-nodes) , to 
accept cross-check point outputs and return the results; 

(5) to log disagreement data on the decisions; and 

(6) to provide high coverage and fault tolerance for the 
execution of the above functions. 

Ideally the programs of the C-nodes are written with pro- 
visions to take advantage of D-node services. The relatively 
simple functions of the D-node can be implemented by microcode 
and the D-node response can be very fast. Another advantage 
of using the D-node for decisions (as opposed to doing them in 
the C-nodes) is the high coverage and fault tolerance of the 
D-node (implemented as a self-checking pair) that assures er- 

ror-f ree results . 

The Adapter Ports (A-Ports) of the D-node need to provide 
the same services that the A-nodes provide to the C-nodes, 
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including power switching for spare D-node utilization. In 
addition, the A-ports must also serve to relay appropriately 
formatted C-node messages to the M-cluster, then accept and 
vote on M-cluster responses. The messages are requests for 
C-node power switching, Inner and Outer Ring configuration 
information, and M-cluster activity logs. The D-node can 
periodically request and store the activity logs, thus reduc- 
ing the amount of dynamic storage in the M-nodes . The D-nodes 
can also serve as the repositories of other data that may 
support M-cluster operations, such as the logs of disagree- 
*0 u ments during D-node decisions, etc. 

The relatively simple D-nodes can effectively compensate 
for the low coverage and poor error containment of contempo- 
rary processors ( e. a. Pentium II) by allowing their duplex or 
TMR operation with reliable comparisons or voting and with 
diverse processors executing N-version software for the tol- 
erance of software and hardware design faults. 
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4. A PROOF-OF-CONCEPT EXPERIMENTAL SYSTEM 

The Two Ring configuration, with the Inner Ring and the 
D-nodes providing the fault-tolerance infrastructure for the 
Outer Ring of C-nodes that is a high-performance "client" COTS 
computer, is well defined and complete. 

Many design choices and tradeoffs, however, remain to be 
evaluated and chosen. A prototype DiSTARS system for experi- 
mental evaluation uses a four-processor symmetric multiproces- 
sor configuration [11] of Pentium II processors with the sup- 
porting chipset as the Outer Ring. The Pentium II processors 
serve as C-nodes. The S3-nodes, M-nodes, D-nodes, A-nodes and 
A-ports are being implemented by Field-Programmable Gate Ar- 

jl 14 rays (FPGAs) . 

This development includes construction of power switches 

and programming of typical applications running on duplex 
C-nodes that use the D-node for comparisons; and diversifica- 
tion of C-nodes and N-version execution of typical applica- 
tions. Building and refining the Inner Ring that can support 
the Pentium II C-nodes of the Outer Ring provides a proof of 
the "fault-tolerance infrastructure" concept. 



5 
6 



i-J 

i ^ 

m 9 
m 

*0 io 

m 

i* 5 13 



15 



16 



17 
18 
19 
20 
21 
22 



eg P. Lippman / June 20, 2001 

A. A. Avizienis, Ph. D. / xAAA-02 



5. EXTENSIONS AND APPLICATIONS 



The Inner Ring and D-nodes of DiSTARS offer what may be 
called a "plug-in" fault- tolerance infrastructure for the 
client system, that uses contemporary COTS high-performance, 
but low-coverage processors with their memories and supporting 
chipsets . The infrastructure is in effect an analog of the 
human immune system [12] in the context of contemporary hard- 
ware platforms [8] . DiSTARS is an illustration of the appli- 
cation of the design paradigm presented in [12] . 

A desirable advance in processor design is to incorporate 
an evolved variant of the infrastructure into the processor 
structure itself. This is becoming feasible as the clock rate 
and transistor count on chips race upward according to Moore's 
Law. The external infrastructure concept, however, remains 
viable and necessary to support chip-level sparing, power 
switching, and design diversity for hardware, software, and 
device technologies . 

The high reliability and availability that may be at- 
tained by using the infrastructure concept in system design is 
likely to be affordable for most computer systems . There ex- 
ist, however, challenging missions that can only be justified 
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if their computers have high coverage with respect to tran- 
sient and design faults as well as low device failure rates. 

Two such missions that are still in the concept and 
preliminary design phases are the manned mission to Mars [13] 
and unmanned interstellar missions [14] . 

The Mars mission is about 1000 days long. The proper 
functioning of the spacecraft and therefore the lives of the 
astronauts depend on the continuous availability of computer 
support, analogous to primary flight control computers in com- 
mercial airliners. Device failures and wear-out are not major 
threats for a 1000 day mission, but design faults and tran- 
sient faults due to cosmic rays and solar flares are to be ex- 
pected and their effects need to be tolerated with very high 
coverage, i. e. probability of success. It will also be nec- 
essary to employ computers to monitor all spacecraft systems 
and perform automatic repair actions when needed [9, 15], as 
the crew is not likely to have the necessary expertise and 
access for manual repairs. Here again computer failure can 
have lethal consequences and very high reliability is needed. 

Another challenging application for a DiSTARS type fault- 
tolerant computer is on-board operation in an unmanned space- 
craft intended for an interstellar mission. Since such mis- 
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sions are essentially open-ended, lifetimes of hundreds or 
even thousands of years are desirable. For example, currently 
the two Voyager spacecraft (launched in 1977) are in inter- 
stellar space, traveling at 3.5 and 3.1 A. U. (astronomical 
units) per year. One A. U. is 150 -10 6 kilometers, while the 
nearest star Alpha Centauri is 4.3 light years, or approxi- 
mately 63,000 A. U. from the sun. Near-interstellar space, 
however, is being explored, and research in breakthrough pro- 
pulsion physics is being conducted by NASA [14] . 

An interesting concept is to create a fault-tolerant 
relay chain of modest-cost DiSTARS type fault- tolerant space- 
craft for the exploration of interstellar space. One space- 
craft is launched on the same trajectory every n years, where 
n is chosen to be such that the distance between two succes- 
sive spacecraft allows reliable communication with two closest 
neighbors ahead and behind a given spacecraft (Fig. 10) . The 
loss of any one spacecraft does not interrupt the link between 
the leading spacecraft and Earth, and the chain can be re- 
paired by slowing down all spacecraft ahead of the failed one 
until the gap is closed. 
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Additional information appears in A. Avizienis, "The hun- 
dred year spacecraft", in Proc. of the 1st NASA/DoD Workshop 
on Evolvable Hardware , pages 233-39 (July 1999) . 

6. KEY TO THE DRAWINGS 

(a) Fias. 1. 2 and 6 — These block diagrams use the 
following designators in common . 

encircled "X": cluster core 

encircled "M*" (15 in Fig. 1) : M-cluster 

encircled "M" (unshaded; 201a, 201b and 201c in Fig. 2, 
but 601a, 601b and 601c in Fig. 6) : M-node (moni- 
tor-node) , powered 

encircled "M" (shaded) : M-node, unpowered (spare) 

encircled "D" (13 in Fig. 1) : D-node 

encircled "C" (11 in Fig. 1) : C-nodes 

solid black circle with an associated tangential line (14 

in Fig. 1) : adapter-node (A-node) 
solid black circle with an associated through-line (18 in 

Fig. 1) : adapter-port (A-port) 
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large bold circle (16 in Fig. 1; 202 in Fig. 2): M-bus 
larger, fine circle (17 in Fig. 1; but 203 in Fig. 2): 

A-lines 
IP: inner-ring power 
S in square: power switch 
S3: set of S3-nodes. 

Additional item in Fig. 1: 
12 outer-ring bus 

Additional items in Fig. 6: 

602 IC-bus 

603 disagree lines, internal-error lines, clock lines 
and replacement- request lines . 

(b) Fig. 3 — The following explanations apply to the 
n-bit comparator and switch. Section (1) of the drawing is 
the symbol only; section (2) shows the detailed structure. 

c is an n-bit self-checking comparator 
d is a set of n tristate driver gates 
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if x = Y, then e = 1 and f = x 

if x ^ y or if c indicates its own failure, 

then e = 0 and f = Z (high impedance) . 



(c) Fia. 4 — The following 
of Figs. 4a and 4b. 

Internal Blocks : 

401. Encoder 

402 . Encoder Register 

403. ID Number for A-pair (ROM) 

404. Comparator (self -checking) 

405. Address Register 

406. Decoder 

407 . Command Register 

408 . Sequencer 

409. A-line Encoder & Sequencer 

410. Majority Voter 



explanations apply to both 
Outputs : 

441a. Messages to C- (or D-) 

Node via CS 1 
442a. Node Power On/Off 

Command via CS 2 (C- 

or D-node power) 
443a. A-node Power Off 

Command to A-pair 

Fuse 

444a. A-line to M-nodes 
(directly) 
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411-414. Input Registers 

415. Outer Ring Power Switch 

416. Inner Ring Power Fuse 



445a. Messages to M-nodes 
via CS 3 and the 
M-bus 

446. Outer Ring Power (to 
C-node) 



Inputs : 

421a. -424a. From M-bus 
425a. Inner Ring Clock 
426a. Inner Ring Power 

(via Fuse) 
427a. Power Switch Status 
428a. Error Signal from CS 1 
429a. Error Signal from CS 2 
430a. Error Signal from CS 3 
431a. Inputs from C- (or D-) node 

432 . Disagreement Signal from Voter 

433. Message from C- (or D-) Node 

434. Comparator Output 

435. Command to Sequencer 

450 . Inner Ring Power 

451 . Outer Ring Power 
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Inputs for A-ports Only: 
436a. Error Signal from CS 4 
437a. Error Signal from CS 5 

(these error signals are 
shown in Figure 9) 



The Clock (425a) , Power (426a) and Sequencer (408) out- 
puts are connected to all internal blocks. To avoid clutter, 
those connections are not shown. 

MUfctonal no— ^ Elements 436a, 437a are on the 

A-ports only. 

Additional notes Figure 4b: 

(1) The A-nodes a and b, and all blocks shown here (except 
the C-node) , form one ASIC package. 

(2) Inputs 443a or 443b permanently disconnect IR Power from 
an A-pair. 

(3) The input and output numbers refer to Fig. 4a. 



(d) Fio. 5 — Below are explanations for Fig. 5. Th< 
Clock (520) , Power (533) and Sequencer (508) are connected 
all Internal Blocks. To avoid clutter, those connections 
not shown . 
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Internal Blocks : 

501. IC-Bus Buffer Storage 

502. System Time Register 

503. M-Cluster Status Register 

504. Outer Ring Status Register 

505. ROM Response & Power-up Sequence Store 

506. M-bus Buffer Store 

507. Input Buffer Store 

508. Sequencer (State Machine) and BIST 

509. Output Buffer Store 

510. A-line Input Buffer Store 

511. Power Switch (controlled by k inputs from S3 nodes) tl 
works on the "summation" principle of three-valued in- 
puts: the three possible values of s ± (i = 1, 2,..., k) 
are ON = +1 , OFF = -1, tristate = 0. 




i 



Inputs : 



520. Clock from S3 nodes 



521. Power Switch Control from S3 nodes (k nodes) 



522. M-Bus (n lines) 



. Avizienis, Ph. D. / xAAA-02 



65 



P. Lippman / June 20, 2001 



523.-524. A-lines from first A-pair 

525.-526. A-lines from Nth A-pair (the total number of pairs 
of A-lines is N) 

527. "Disagree" signals from other M-nodes (4) 

528. Internal or BIST error signals from other M-nodes 

529. "Start BIST" command from S3 nodes 

530. "Resume TMR" (or Duplex, or Simplex) commands from S3 

531 . "Power-Up Outer Ring" command from S3 

532. IC-Bus (j lines) 

533. Inner Ring Power (from switch) 

Outputs : 

540. to IC-Bus (j lines) 

541. to M-Bus (n lines) 

542 . Power Switch Status to S3 nodes 

543. Replacement Request to S3 nodes 

544 . Internal or BIST error to other M-nodes and S3 nodes 

545. "Disagree" signal to other M-nodes and S3 nodes 
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(e) Fig. 7 — The following explanations apply to Fig. 7 
only. Outputs 721 through 730 are connected in a wired-"OR" 
for all four S3 nodes . 

Internal Blocks : 

701. Fault-Tolerant Clock (one for both a and b sides), 

connected to all Internal Blocks (connections not shown) 
702a. System Time Counter 

703a. Interval Timer (for power-off intervals) 
704a. Outer Ring Status Register 
705a. M-Cluster Status Register 

706a. Sequencer (State Machine) with outputs to all Internal 

Blocks (connections not shown) 
707. Backup Power Source, common for a and b sides (connected 

to all Internal Blocks, connections not shown) 

Inputs : 

710. Clock signals from 3 other S3 nodes 

711. From IC-Bus (j lines) 

712. Power Switch Status from M-nodes (5) 

713. Internal or BIST error signals from M-nodes (5) 

714 . "Disagree" signals from M-nodes (5) 
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1 715. Replacement Request Signals from M- nodes (5) 

2 716. Power-Off signal from critical event sensors (excessive 

3 radiation, power instability, etc.) or from system 

4 operator 

5 717. Power-On signal (same sources as 716) 

6 718. Primary Inner Ring Power (connected to all Internal 

7 Blocks , connections not shown) 

X t 8 

p 

9 Outputs : 

4& * 

,y io 720. Clock signal to 3 other S3 nodes (connected to all 

in 

k§ u Internal Blocks, connections not shown) 

O 12 721. System Time to IC-Bus 

13 722 . Interval Time to IC-Bus 

U 

n 

i~ 14 723. Outer Ring Status to IC-Bus 

15 724. M-Cluster Status to IC-Bus 

16 725. "Start BIST" Command to M-nodes 

17 726. "Resume TMR" (or Duplex, or Simplex) command to M-nodes 

18 727. "Power Up Outer Ring" command to M-nodes 

19 728. "M-Cluster is Dead" message to system operator 

20 729. Power Switch control for all A-nodes 

21 730. Power Switch control to M-nodes (5 lines) 
22 
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1 

2 (f ) Fig. 8a — At Start, only the S3-nodes are powered 

3 and produce clock signals . There are 3 + n unpowered M-nodes , 

4 where n is the number of spare M-nodes originally provided. 

5 Figs . 2 and 6 show n = 2 . 

6 When the Power On sequence is carried out after a preced- 

7 ing Power Off sequence, then the MC-SR contains a record of 

12 

8 the M-node status at the Power-Off time, and the M-nodes that 

*• m 

m 9 were powered then should be tested first. 

m 

*.Q li 

i 

^ 12 (g) Fig . 8b — The sequence is repeated for all A-pairs 

5 - : 

\i • 
I ? \ 

13 until all C-nodes and D-nodes of the Outer Ring have been tes- 

U 
O 

14 ted and the OR-SR (504) contains a complete record of their 

15 status. The best sequence is to power on and test the D-nodes 

16 first, followed by the top priority (operating system) C- 

17 nodes, then the remaining C-nodes. If the number of powered C- 

18 and D-nodes is limited, the remaining good nodes are powered 

19 off after BIST and recorded as "Spare" in the OR-SR. The OR-SR 

20 contents are also transferred to the S3 nodes at the end of 

21 the sequence . 

22 
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1 (h) Fig. 8c — This sequence is carried out when the 

2 input 716 is received by the S3 nodes, i.e., when a catas- 

3 trophic event is detected or when the Di STARS is to be put 

4 into a dormant state with only the S3 nodes in a powered con- 

5 dition, with System Time (702a) and a power-off Interval Timer 

6 (703a) being operated. 

7 

8 (i) Fig. 9 — This D-pair replaces the C-node in Figure 

IB 

|B 9 4b to show how the A-ports are connected to the D-nodes . The 

m 

=0 10 Twin D-nodes and their A-ports form one ASIC package. The 

'0 11 Outer Ring Power 446 and the Sequencer and Clock 901a are con- 

12 nected to all Internal Blocks. 

m 

: '! 

sT 14 Internal Blocks : 

15 901a. Sequencer and Clock 

16 902a. Input Buffer Store 

17 903a. Encoder of Messages to M-nodes (M-Cluster) 

18 904a. Decision Algorithms: Exact and Inexact (N-Version) 

19 Comparators and Voters 

20 905a. Storage Array for D-node Logs 

21 906a. Output Buffer Store 

22 907a. Decoder of Messages from M-Cluster 
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1 

2 Inputs : 

3 426 Inner Ring power (via Fuse 416) 

4 441 Messages from A-port to D-node 

5 446 Outer Ring power (from Power Switch 415) 

6 910 Decision Requests and Messages from C-nodes 

7 

8 Outputs : 

sin 

jg 9 431 Messages from D-node to M-nodes (via A-ports) 

a ~ ; 

:p io 436 Error Signal from CS4 

m 

sU ii 437 Error Signal from CSS 

5 

?3 12 911 Decision Results and Messages to C-nodes 

= f = 

o 

!_.- 

15 
16 

17 It will be understood that the foregoing disclosure is 

18 intended to be merely exemplary, and not to limit the scope of 

19 the invention — which is to be determined by reference to the 

20 appended claims . 
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